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ABSTRACT 


We  propose  a  data  migration  mechanism  that  allows  an  explicit  and  controlled  mapping 
of  data  to  memory.  While  read  or  write  copies  of  each  data  element  can  be  assigned  to 
any  processor’s  memory,  longer  term  storage  of  each  data  element  is  assigned  to  a  specific 
location  in  the  memory  of  a  particular  processor.  We  present  data  that  suggests  that  the 
scheme  may  be  a  practical  method  for  efficiently  supporting  data  migration. 
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1  Introduction 


It  is  well  known  that  data  distribution  and  load  balance  play  critical  roles  in  determining 
the  performance  one  can  expect  to  obtain  from  distributed  machines.  Data  must  be 
moved  from  processor  to  processor  in  response  to  computational  demands.  One  way 
of  supporting  data  migration  is  to  explicitly  designate  blocks  of  data  that  are  to  be 
prefetched  into  the  memory  of  a  given  processor  and  to  copy  the  data  into  customized 
data  structures.  Programs  are  written  on  each  processor  with  intimate  knowledge  of 
the  format  used  to  store  off-processor  data.  For  some  problems,  this  approach  to  data 
distribution  can  be  extremely  efficient.  However,  programming  in  this  manner  can  be 
very  time  consuming,  and  can  lead  to  programs  that  are  difficult  to  debug.  In  this  paper 
we  describe  a  scheme  which  allows  data  to  be  automatically  moved  to  the  processor 
that  needs  it.  In  the  remainder  of  this  section,  we  provide  an  overview  of  our  data 
migration  scheme.  Section  2  gives  details  of  the  hashed  cache  used  in  this  scheme.  The 
remainder  of  this  paper  describes  the  experiments  that  we  have  performed.  Section  3 
studies  the  overheads  due  to  accessing  the  hashed  cache.  Section  4  outlines  the  results 
we  have  obtained  using  the  model  problem  of  sweeping  over  an  unstructured  mesh.  These 
results  were  obtained  with  a  synthetic  workload  along  with  an  unstructured  mesh  used  in 
aeronautical  simulations.  We  finally  present  our  conclusions  in  Section  5. 


Overview  of  the  Data  Migration  Scheme 

In  our  scheme,  data  structures  can  be  distributed  across  the  set  of  processors.  Such 
distribution  of  data  structures  across  processors  are  supported  in  languages  and  systems 
such  as  [2,  3,  4,  8,  10,  11]. 

While  read  or  write  copies  of  each  data  element  can  be  assigned  to  any  processor’s 
memory,  longer  term  storage  of  each  data  element  is  assigned  to  a  specific  location  in 
the  memory  of  a  particular  processor.  A  processor  needing  to  read  or  write  an  array 
element  gets  a  copy  of  that  element.  In  a  distributed  memory  machine,  run-time  proce¬ 
dures  carry  out  the  actual  movement  of  data.  Local  copies  of  data  are  stored  in  a  hash 
table.  The  hashing  scheme  for  storing  off-processor  data  in  distributed  machines  will  be 
called  a  hashed  cache.  There  are  many  possible  ways  of  organizing  such  a  hash  table,  we 
experimentally  examine  a  straightforward  scheme  for  storing  and  retrieving  off-processor 
data. 

It  is  not  realistic  to  expect  that  all  loops  can  be  parallelized  or  that  it  will  be  feasible 
to  always  map  arrays  so  that  computations  taking  place  on  a  given  processor  use  mostly 
local  data.  We  must  be  able  to  handle  the  situation  in  which  a  single  processor  must  access 
large  numbers  of  off-processor  array  elements.  This  can  happen  either  in  a  sequential  code 
that  executes  on  a  single  processor  of  a  distributed  machine  or  in  a  given  processor’s  part 
of  a  parallelized  loop  computation.  The  hashed  cache  provides  a  mechanism  for  memory 
management  on  a  distributed  memory  machine. 
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Coherency  between  the  distributed  data  structures  and  the  hashed  caches  must  be 
maintained  in  software.  It  is  particularly  easy  for  compilers  to  generate  code  that  guaran¬ 
tees  coherency  for  (nested)  parallel  loops  or  sequential  code.  Although  the  hashed  cache 
mechanism  can  be  used  in  more  general  contexts,  we  will  limit  the  discussion  that  follows 
to  those  two  cases. 

A  parallel  loop  is  transformed  into  two  parts  by  a  compiler.  One  part  is  called  an 
inspector ,  the  other  is  called  an  executor.  The  inspector  is  responsible  for  determining 
what  data  elements  are  required  by  a  loop,  the  executor  carries  out  the  actual  computa¬ 
tions.  Note  that  an  inspector  may  have  to  be  incremental  subsets  of  the  loop  iterations 
when  the  amount  of  off-processor  data  to  be  stored  exceeds  the  memory  allocated  to  the 
hashed  cache.  The  idea  of  splitting  a  parallel  loop  into  an  inspector  and  executor  appeared 
in  [5,  7,  9,  12].  In  the  context  of  sparse  matrix  computations,  the  idea  was  advanced  in 
[!]• 

2  The  Hashed  Cache 

We  presume  here  that  arrays  have  been  distributed  based  on  user  annotations  or  directives. 
Once  a  distributed  array  is  initialized,  one  can  use  the  specified  partitioning  information  to 
find,  for  any  distributed  array  element,  a  unique  processor  P  along  with  a  unique  location 
in  that  processor  P’s  storage.  In  each  processor,  contiguous  memory  locations  are  used  to 
store  local  elements  of  a  given  distributed  array.  The  unique  location  in  P’s  storage  can 
thus  be  expressed  as  an  integer  offset  0. 

The  program  in  Figure  1  will  be  used  as  a  running  example  in  this  discussion.  This 
program  performs  a  sequence  of  matrix  vector  multiplications.  In  order  to  compute  y(i) 
at  each  iteration,  we  need  yold(nbrs(i,  j)).  At  the  end  of  each  matrix  vector  multipli¬ 
cation  in  loop  S5,  the  newly  computed  values  y(i)  are  copied  back  into  yold(i). 

Both  arrays  y  and  yold  are  distributed.  When  the  loop  is  distributed,  loop  iterations 
may  be  assigned  to  processors  in  an  arbitrary  fashion.  Consequently,  long  term  storage 
of  elements  of  yold  may  not  be  assigned  to  the  processors  that  execute  code  referring  to 
those  elements.  Due  to  the  structure  of  the  code,  it  is  straightforward  in  this  case  to  make 
sure  that  array  y  is  distributed  so  that  each  processor  accesses  only  local  elements  of  y. 

The  global  arrays  are  initialized  at  the  start  of  the  program.  We  proceed  to  describe 
the  primitives  that  support  the  inspector  and  executor  phases  of  the  hashed-cache  system. 
To  best  understand  the  details  of  the  inspector  and  executor  phases  we  describe  them  in 
the  context  of  the  example  presented  in  Figure  1. 

2.1  The  Inspector  Phase 
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51  do  iter=l,  num 

52  do  i=l,n**2 

53  do  j=0,  m 

54  y(i)  +=  valuesCi , j)*yold(nbrs(i , j)) ; 
end  do 

end  do 

55  do  k=l,n**2 

yold(k)  =  y(k) ; 
end  do 

end  do 

Figure  1:  Sparse  Matrix  Vector  Multiply 

Figure  2  depicts  the  pseudo-code  of  the  inspector  phase  for  the  sparse  matrix  vector 
multiply.  During  the  inspector  phase  we  go  through  the  inner-loop  once  to  check  for  local 
and  non-local  global  array  accesses.  If  an  array  reference  is  local  we  do  nothing.  However, 
if  it  is  a  non-local  reference  to  a  global  array,  we  compute  the  processor  on  which  the 
element  resides  and  its  offset.  We  need  to  store  this  information  in  such  a  way  that 
accessing  it  is  efficient.  This  is  achieved  by  using  a  hashed  cache  scheme. 

The  location  of  a  non-local  distributed  array  element  is  determined  by  a  hash  function. 
Currently,  we  use  a  hashing  function  that,  for  a  hash  table  of  size  2fc,  simply  masks  the 
lower  k  bits  of  the  key.  The  key  is  formed  by  concatenating  the  processor-offset  pair, 
(P,  0),  that  corresponds  to  a  distributed  array  reference.  A  linked  list  is  used  in  case  of 
collisions.  To  access  the  cache,  the  appropriate  linked  list  must  be  traversed  until  the 
correct  key  is  found.  Each  entry  in  the  hash  table  consists  of  the  following: 

1.  a  reference  to  the  non-local  data  item,  i.e. ,  the  data  item’s  processor-offset  pair, 

2.  whether  the  item  is  to  be  read  (read  flag), 

3.  whether  the  item  is  to  be  written  (write  flag), 

4.  the  data  value  itself. 

5.  a  pointer  to  the  next  data  item  that  hashed  to  the  same  location. 


If  the  data  item  is  a  non-local  read  reference  R,  it  is  processed  by  the  process- global- 
read()  routine.  The  routine  is  described  as  follows: 
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Loop  over  local  iterations  i  assigned  to  P 
do  j  =  0,m 

Compute  processor,  offset  pair  for  element  of  yold 

If  yold  reference  is  to  non-local  array  element, 
process-global -read () 

end  do 

End  loop  over  local  iterations  i 

Loop  over  local  iterations  k  assigned  to  P 

If  yold  reference  is  to  non-local  array  element, 
process-global -write () 

End  loop  over  local  iterations  k 

Perform  global  communications  to  set  up  send  and  receive 
pairs  for  the  non-local  data  to  be  scattered  and  gathered 

Figure  2:  Inspector:  Sparse  Matrix  Vector  Multiply _ 


process- glob  al-read() 

1.  Search  for  the  reference  R  in  the  hashed-cache. 

2.  If  R  exists  and  the  read  flag  is  set,  do  nothing. 

3.  If  R  exists  and  the  read  flag  is  not  set,  set  read  flag. 

4.  If  R  is  not  found  in  the  hashed  cache,  create  an  entry  with  read  flag  set  and  enter  it 
in  the  hashed-cache. 

5.  In  the  latter  two  situations,  increment  a  count  variable  that  contains  the  number 
of  non-local  elements  to  be  gathered  from  the  processor  P  on  which  this  element 
resides.  The  offset  of  this  element  is  inserted  in  a  list  containing  the  offsets  of  all 
the  elements  to  be  gathered  from  P. 

Non-local  array  references  R  that  are  written  to,  are  processed  by  the  process-global- 
write()  routine.  As  was  the  case  with  process-global-read(),  this  routine  maintains  a  count 
variable  containing  the  number  of  non-local  elements  to  be  scattered  to  P.  It  also  maintains 
a  list  of  the  offsets  of  all  the  elements  R  to  be  scattered  to  P.  At  the  completion  of  the 
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do  iter=l ,  num 


process-gather-data ()  -  obtain  needed  yold  values  from  other 

processors  and  put  in  hashed  cache 

Loop  over  local  iterations  i  assigned  to  P 

do  j  =  0,m 

Perform  calculation  reading  yold  values  or  writing  y 
values  using  hashed  cache  or  local  memory  as  is  appropriate, 
end  do 


End  loop  over  local  iterations  i 

Loop  over  local  iterations  k  assigned  to  P 

Perform  assignment  reading  y  values  or  writing  yold  values 
using  local  memory  or  hashed  cache  as  is  appropriate. 

End  loop  over  local  iterations  k 


process-scatter-dataO  -  scatter  yold  values  from  hashed 

cache  to  appropriate  processors 


end  do 


Figure  3:  Hashed  Cache  Executor:  Sparse  Matrix  Vector  Multiply 


inspector  phase  we  precompute  the  communication  pattern  required  to  efficiently  gather 
or  scatter  all  the  relevant  non-local  data  referenced  in  the  loop.  This  requires  a  global 
communication  phase  in  which  all  processors  participate.  For  a  detailed  description, 
see  [5,  12]. 


2.2  The  Executor  Phase 

Figure  3  depicts  the  pseudo-code  of  the  executor  phase  for  the  sparse  matrix  vector 
multiply.  The  non-local  data  required  by  the  inner  loop  is  first  obtained  from  other 
processors  and  stored  in  the  hashed-cache  by  the  process-gather-data  routine.  We  now 
proceed  to  execute  the  doall  loop. 

During  the  execution  of  the  inner-loop  we  check  each  distributed  array  reference  to 
decide  whether  it  resides  in  the  local  array  or  not.  If  it  does,  we  compute  the  offset  of  the 
element  in  the  local  array  and  fetch  the  data  item  from  the  appropriate  memory  location. 
If  it  does  not,  we  fetch  it  from  the  hashed  cache,  if  the  array  reference  occurs  on  the  left 
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hand  and  it  is  non-local,  we  enter  the  new  value  in  the  hashed  cache.  At  the  end  of  the 
execution  of  the  inner-loop,  each  processor  calls  the  process- scatter- data ()  routine.  This 
routine  goes  through  the  list  of  non-lccal  offsets  of  elements  to  be  scattered,  searches  for 
these  elements  in  the  hashed  cache  and  writes  the  value  to  a  list  containing  the  new  values 
to  be  wiitten  to  the  distributed  memory.  The  data  is  then  scattered  to  the  distributed 
memory. 

The  operations  for  computing  processor  number  and  offset  are  computationally  very 
cheap  since  we  assume  the  distributed  array  may  be  partitioned  in  a  block  or  block  wrap 
fashion.  The  size  of  each  block  is  a  power  of  2  and  thus  we  need  to  perform  simple  integer 
operations  such  as  shifts  to  compute  the  offset  and  processor  number  of  a  distributed 
array  element. 

2.3  Enumerated  Version  of  the  Hashed  Cache  System 

We  have  also  implemented  two  modifications  of  the  hashed  cache  system  to  allow  us 
to  quantify  the  cost  of  accessing  the  hashed  cache  and  of  transforming  references  to 
distributed  array  elements  to  locations  in  local  storage.  In  these  schemes,  the  non-local 
data  is  stored  in  the  hashed  cache  as  described  in  Section  2.  Along  with  the  hash  table,  we 
also  have  a  list  of  pointers  to  all  array  elements  referenced  during  the  loop  computation, 
on  processor  P.  This  enumeration  list  allows  us  to  retrieve  off-processor  array  elements 
during  the  executor  phase  without  going  through  the  hash  table.  Furthermore,  locally 
stored  elements  of  the  distributed  array  can  also  be  accessed  without  translating  a  global 
index  to  local  offset  in  P’s  storage.  We  call  this  the  full  enumeration  scheme.  Note  that 
the  enumeration  list  can  be  easily  generated  by  the  inspector  as  it  examines  all  the  array 
references  in  the  loop. 

The  fully  enumerated  version  of  the  hashed  cache  can  clearly  require  extremely  large 
amounts  of  storage  for  storing  its  array  of  pointers,  and  is  not  a  practical  alternative  for 
most  applications.  It  is  possible  instead  to  employ  an  array  of  pointers  that  point  only  to 
hash  table  locations  accessed.  In  this  case,  the  number  of  pointers  maintained  is  equal  to 
the  number  of  accesses  to  off-processor  array  references.  This  version  is  called  the  partial 
enumeration  scheme. 


3  Hash  Table  Overhead 

We  first  present  a  set  of  simple  experiments  to  illustrate  the  overheads  associated  with 
initializing  and  accessing  the  hashed  cache.  An  array  a  is  distributed  between  processors 
in  blocks  of  size  6.  A  loop  executing  on  processor  0  (Figure  4)  accesses  off-processor  data 
with  varying  strides.  Recall  from  Section  2  that  when  multiple  data  elements  hash  to  the 
same  location  we  store  the  overflow  in  a  linked  list.  By  varying  the  stride  of  this  simple 
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Table  1:  Inspector  and  Executor  timings  as  a  function  of  the  loop  stride 


loop 

stride 

max  links 
traversed 

executor 

comm 

time 

(ms) 

executor 

comp 

time 

(ms) 

inspector 

comm 

time 

(ms) 

inspector 

comp 

time 

(ms) 

sequential 

time 

(ms) 

1 

0 

30 

82 

10 

49 

20 

2 

1 

31 

93 

10 

62 

20 

4 

3 

19 

112 

11 

86 

20 

8 

7 

15 

154 

12 

130 

20 

16 

15 

14 

238 

15 

217 

20 

32 

31 

17 

381 

21 

391 

20 

31 

0 

18 

82 

22 

49 

20 

loop,  we  are  able  to  study  the  effect  of  link  traversal  through  the  hashed  cache  on  the 
executor  and  inspector  times.  We  fix  the  following  parameters: 

•  hash  table  size  =  4096 


•  numiter  =  4096 


•  total  non-local  accesses  =  4096 


•  b  =  8192 


Table  1  reports  the  experimental  results  we  obtained  by  running  the  loop  in  Figure  4 
with  varying  strides,  all  but  one  of  which  was  a  power  of  two.  These  experiments  were 
carried  out  on  a  32  node  iPSC/2.  With  an  odd  stride,  there  were  no  hash  table  collisions 
and  hence  no  links  need  to  be  traversed.  This  can  be  seen  from  the  similar  timings  for 
strides  1  and  31.  With  power  of  two  strides,  we  can  have  up  to  stride  —  1  links  in  a  hash 
bucket.  The  overhead  of  the  executor  increases  as  the  possible  number  of  links  traversed 
for  each  hash  table  access  increases.  This  is  reflected  in  the  increase  of  the  executor 


50  if (my_processor  ,eq.  0) 

51  do  j  =  b,  numiter*stride  +  b,  stride 

52  asum  =  asum  +  a ( j ) 
end  do 


Figure  4:  Processor  0:  Accessing  non-local  data 
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computation  time  as  the  number  of  links  increases  from  0  links  to  31.  The  executor 
communication  time,  however,  dropped  by  50  %.  This  is  because  with  a  block  size  of 
b  and  stride  1  all  the  4096  non-local  elements  accessed  by  the  above  loop  on  processor 
0,  reside  on  a  single  processor,  processor  1.  Thus  processor  1  has  to  send  all  the  4096 
array  elements  to  processor  0.  With  stride  8,  the  4096  elements  accessed  by  processor  0 
are  distributed  across  four  other  processors.  Thus,  these  four  processors  send  their  data 
concurrently  to  processor  0,  reducing  the  overall  executor  communication  time. 

As  the  stride  increases,  the  number  of  processors  from  which  data  is  needed,  increases. 
This  is  reflected  in  the  inspector  communication  timing,  which  doubles  as  the  the  stride 
goes  from  1  to  32.  The  stride  used  also  has  an  effect  on  the  inspector  computation  time. 
As  noted  above,  an  increase  in  the  stride  increases  the  maximum  number  of  links  in  the 
hash  table  which  is  turn  leads  to  an  increase  in  the  time  required  to  access  the  hash  table. 
Since  the  inspector  computation  time  is  directly  dependent  on  the  time  required  to  access 
the  hash  table,  we  see  a  fairly  rapid  increase  in  this  time  with  the  increase  in  stride. 


4  Unstructured  Mesh  Results 

In  this  section  we  present  the  performance  of  the  hashed  cache  system  for  the  program 
depicted  in  Figure  1.  This  code  exhibits  greatly  varying  patterns  of  locality  depending  on 
how  loop  iterations  are  assigned  to  processors  and  the  contents  of  the  integer  array  nbrs. 
The  integer  array  nbrs  can  be  viewed  as  a  representation  of  a  sparse  or  unstructured 
mesh.  We  obtained  these  meshes  in  the  following  two  ways  : 

•  We  used  a  synthetic  workload  to  generate  sparse  matrices  with  differing  dependency 
patterns. 

•  We  used  an  unstructured  mesh  that  was  generated  to  carry  out  an  aerodynamic 
simulation. 

The  next  two  sections  describe  the  details  of  the  synthetic  workload  and  the  unstructured 
mesh. 


4.1  Synthetic  Workload 

The  synthetic  workload  was  defined  in  the  following  way.  A  square  mesh  in  which  each 
point  was  linked  to  four  nearest  neighbors  was  incrementally  distorted.  Random  edges 
were  introduced  subject  to  the  constraint  that  in  the  new  mesh,  each  point  still  required 
information  from  four  other  mesh  points. 

Our  workload  generator  makes  the  following  assumptions: 
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1.  The  problem  domain  consists  of  a  2-dimensional  mesh  of  points  which  are  numbered 
using  their  row  major  or  natmal  ordering; 

2.  Each  point  is  initially  connected  to  its  four  nearest  neighbors 

3.  Each  link  produced  in  the  above  step  is  examined,  with  probability  q  the  link  is 
replaced  by  a  link  to  a  randomly  chosen  point. 

4.  We  use  the  rand()  function  available  on  Unix  System  V 


Once  generated,  this  connectivity  information  is  stored  in  the  integer  array  nbrs.  We 
used  this  workload  generator  to  obtain  a  set  of  matrices  generated  from  256*256  meshes. 
These  matrices  were  used  to  perform  a  sequence  of  parallelized  sparse  matrix  vector  mul¬ 
tiplications  using  our  hashed  cache  data  migration  scheme  for  non-local  array  references 
to  yold.  We  partitioned  yold  in  various  ways.  First,  let  us  define  the  terms: 


•  p  —  Total  Number  of  Processors 

•  MSize  =  matrix  size  =  n  x  n 

•  BlockSize  —  MSize  /p 


We  partitioned  yold  as  follows: 


1.  partition  the  array  indices  in  contiguous  blocks  of  size  BlockSize ,  i.e.  processor  i  is 
assigned  indices  i  X  BlockSize  through  (i  +  1)  x  BlockSize  —  1 

2.  partition  the  indices  in  an  interleaved  fashion  i.e.  processor  i  is  assigned  indices 
i,  i  -f  p,  i  +  2 p, . . . ,  i  +  ( BlockSize  —  l)xp. 

In  the  following  sections,  experiments  involving  matrices  generated  from  256  by  256 
meshes  in  which  yold  is  partitioned  into  blocks  will  be  labeled  by  the  descriptor  Blocked. 
When  yold  is  partitioned  in  an  interleaved  manner  and  the  same  sized  mesh  is  used,  the 
resulting  experiments  will  be  labeled  by  the  descriptor  Interleaved. 

We  now  proceed  to  describe  the  details  of  the  unstructured  mesh. 

4.2  Unstructured  Mesh 

We  used  an  unstructured  mesh  that  was  generated  to  carry  out  an  aerodynamic  simulation 
involving  a  multielement  airfoil  in  a  landing  configuration  [6]. 
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The  unstructured  mesh  consists  of  a  highly  non-uniform  scattering  of  mesh  points 
joined  together  by  line  segments  to  form  a  set  of  triangular  elements.  The  algorithm 
used  is  the  Delaunay  Triangulation  algorithm  [13].  The  resulting  mesh  was  then  modified 
slightly  in  a  postprocessing  phase.  Details  of  this  mesh  generation  process  can  be  found 
in  [6].  An  illustration  of  the  mesh  used  is  shown  in  Figure  5. 

To  obtain  an  experimental  estimate  of  the  efficiency  of  the  executor  and  inspector  on 
the  Intel  iPSC/2,  we  carried  out  a  sequence  of  sparse  matrix-vector  multiplications  using 
the  unstructured  mesh  described  above. 


4.3  Performance  of  Off-Processor  Data  Access  Mechanisms 

In  this  section,  we  present  data  that  gives  an  overview  of  some  of  the  performance  tradeoffs 
between  different  variants  of  the  hashed  cache  data  access  mechanism.  A  major  motivation 
for  this  set  of  experiments  is  to  quantify  the  costs  of  accessing  copies  of  off- processor  data. 
There  are  clearly  a  tremendous  number  of  different  schemes  that  could  be  used  to  store  and 
retrieve  off-processor  data,  and  we  do  not  attempt  to  argue  that  the  hashed  cache  method 
is  in  any  sense  optimal.  In  Section  2.3  we  suggested  building  a  list  of  pointers  to  locations 
in  the  hashed  cache  where  off-processor  data  is  to  be  cached;  we  called  this  mechanism 
■partial  enumeration.  When  computations  are  carried  out,  copies  of  needed  off- processor 
data  can  be  accessed  by  dereferencing  pointers  rather  than  by  searching  the  hashed  cache. 
We  present  in  Table  2  parallel  efficiencies  obtained  using  the  hashed  cache  and  partial 
enumeration  off- processor  data  access  strategies.  These  parallel  efficiencies  only  include 
the  executor  times,  they  do  not  include  preprocessing  overhead.  As  one  might  expect, 
the  efficiencies  obtained  vary  with  the  problem,  but  the  use  of  partial  enumeration  led  to 
efficiencies  which  ranged  from  roughly  3  to  19  %  higher  than  those  obtained  from  direct 
use  of  the  hashed  cache.  The  number  of  parallel  iterations  per  inspector  quantifies  the 
required  preprocessing  overhead.  Let  Tseq  represent  the  time  required  by  an  optimized 
sequential  code  on  a  single  processor  and  P  represent  the  number  of  processors.  Then 
the  number  of  parallel  iterations  per  inspector  is  defined  as  the  ratio  between  the  time 
required  to  carry  out  a  perfectly  parallelized  iteration  {T„rqj P)  and  the  time  required  for 
the  inspector. 

When  we  use  only  the  hashed  cache  without  enumeration,  the  number  of  parallel  itera¬ 
tions  per  inspector  varies  from  1.08  to  5. 10.  The  number  of  executor  iterations  required  to 
carry  out  the  inspector  ranges  from  0.66  to  0.99  when  only  the  hashed  cache  is  used,  i.e., 
the  inspector  always  takes  less  time  to  run  than  the  executor.  Since  extra  preprocessing  is 
needed  to  assemble  the  enumeration  list,  it  is  to  be  expected  that  the  number  of  parallel 
iterations  per  inspector  is  increased  when  partial  enumeration  is  employed.  The  number 
of  parallel  iterations  per  inspector  in  this  case  ranges  from  1.10  to  6.09. 

In  Section  2.3,  we  described  another  scheme  called  full  enumeration,  in  which  all 
memory  accesses  arc  enumerated.  By  enumerating  all  memory  accesses,  we  can  eliminate 
the  costs  incurred  by 
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Problem 

■S 

Hashed  Cache 

Partial  Enumeration 

_ 

Full  Enumeration 

■ 

effic 

parallel  iters 
/  inspector 

effic 

parallel  iters 
/  inspector 

effic 

parallel  iters 
/  inspector 

Blocked 

0.0 

0.61 

1.08 

0.63 

1.10 

m | 

1.36 

Blocked 

0.2 

0.36 

2.14 

0.39 

2.33 

is 

2.57 

Blocked 

0.4 

0.30 

2.80 

0.35 

3.00 

0.39 

3.27 

Interleaved 

IQ 

0.23 

4.30 

0.2r 

4.44 

0.30 

4.60 

Interleaved 

1 

0.21 

4.57 

0.25 

5.20 

0.27 

5.67 

Interleaved 

E9 

0.19 

5.10 

0.23 

6.09 

0.25 

6.04 

Unstructured 

0.43 

1.81 

0.51 

1.90 

0.63 

1.98 

1.  determining  whether  a  memory  reference  is  to  a  locally  stored  array  element  and  if 
it  is  locally  stored 

2.  translating  global  array  indices  to  locations  in  a  processor’s  local  storage 


The  storage  costs  incurred  for  this  data  access  scheme  are  high  enough  to  make  the  method 
impractical  in  many  situations. 

For  the  Blocked  partitioning,  with  probability  of  deletion,  q  —  0,  most  array  elements 
accessed  are  on  the  local  processor.  The  hashed  cache  executor  efficiency  for  this  problem 
is  0.61.  Use  of  partial  enumeration  leads  to  a  small  improvement  in  efficiency  (0.63),  full 
enumeration  on  the  other  hand  leads  to  an  efficiency  of  0.80.  For  Blocked  with  <7=0.4, 
approximately  half  of  the  array  elements  referenced  are  stored  off-processor.  The  hashed 
cache  executor  efficiency  in  that  case  is  0.30,  the  use  of  partial  enumeration  increases  this 
to  0.35  and  full  enumeration  increases  efficiency  further  to  0.39.  The  cost  of  accessing 
off-processor  data  in  the  hashed  cache  is  clearly  a  mud  more  important  factor  with  q  = 
0.4  which  has  a  larger  number  of  off-processor  references. 

We  also  present  in  Table  2  the  parallel  efficiencies  arising  from  the  application  derived 
unstructured  mesh  described  in  Section  4.2.  This  mesh  was  partitioned  into  32  strips, 
and  a  strip  was  assigned  to  each  processor.  The  strips  were  chosen  so  that  the  same 
number  of  floating  point  computations  were  required  by  each  processor.  The  parallel 
efficiency  obtained  when  we  used  only  the  hashed  cache  data  structure  was  0.43.  Partial 
enumeration  caused  this  efficiency  to  increase  to  0.51  and  full  enumeration  resulted  in  a 
further  increase  to  an  efficiency  of  0.63.  The  number  of  parallel  iterations  per  inspector 
varied  from  1.81  when  the  hashed  cache  was  used  to  1.98  when  full  enumeration  was 
employed. 

In  Table  3  we  present  the  executor  computation  times  for  the  hashed  cache,  partial 
enumeration  and  f<  11  enumeration  data  access  methods.  The  executor  computation  times 
include  the  time  required  to: 
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1.  copy  off-processor  data  into  hashed  cache 


2.  distinguish  between  local  and  non-local  references  (for  hashed  cache  and  partial 
enumeration  case) 

3.  obtain  required  locally  stored  data 

4.  access  hashed  cache  (by  searching  hashed  cache  or  by  pointer  dereference  ) 

5.  perform  floating  point  computations 

6.  store  off-processor  writes  to  hashed  cache 

7.  copy  from  hashed  cache  data  needing  to  be  scattered  to  other  processors 

.  The  executor  computation  time  does  not  include  the  time  required  for  carrying  out 
interprocessor  communication. 

For  a  given  sized  mesh,  all  of  the  synthetic  workload  problems  require  the  same  number 
of  operations.  Consider  first  the  Blocked  partitioning  generated  by  the  synthetic  workload 
generator.  When  the  hashed  cache  was  employed,  as  q  increased  from  0.0  to  0.4,  the 
executor  computation  time  increased  41  %.  Both  partial  and  full  enumeration  eliminate 
the  need  to  search  the  hashed  cache  for  each  non-local  memory  reference.  The  increase 
in  execution  time  as  q  went  from  0  to  0.4  for  the  Blocked  partitioning  was  consequently 
only  18  and  19  %  for  partial  and  full  enumeration  respectively 

When  yold  is  assigned  in  an  interleaved  manner,  for  all  values  of  q  most  array  refer¬ 
ences  are  non-local.  Consider  first  the  case  in  which  the  hashed  cache  is  accessed  directly. 
We  note  that  for  Interleaved,  q  equal  to  0,  we  have  a  hashed  cache  executor  computation 
time  of  185  ms,  while  for  q  equal  to  0.4  we  have  a  time  of  232  ms.  This  increase  made 
sense  when  we  noted  that  the  number  of  different  non-local  off-processor  references  in¬ 
creased  with  q.  Furthermore  there  is  an  increase  with  q  in  the  number  of  hashed  cache 
links  that  must  be  traversed  in  order  to  access  copies  of  off-processor  data  (as  we  will 
show  in  Table  4).  A  more  modest  increase  in  execution  time  with  q  was  noted  in  the 
partial  enumeration  and  full  enumeration  cases;  in  these  cases  we  expect  only  increases  in 
the  cost  of  copying  elements  to  and  from  the  hashed  cache  to  play  a  role  in  the  execution 
time  increase.  For  the  application  derived  unstructured  mesh  from  Section  4.2,  we  mea¬ 
sured  executor  computation  times  of  34  ms,  27  ms  and  20  ms  for  hashed  cache,  partial 
enumerations  and  full  enumerations  respectively. 

Further  insight  into  the  behavior  of  the  hashed  cache  can  be  obtained  from  Table  4. 
This  table  categorizes  array  references  in  processor  0,  it  gives  the  percentage  of  references 
to  locally  stored  data  and  to  data  accessed  by  traversing  varying  numbers  of  links  in  the 
hashed  cache.  For  instance,  for  Blocked  with  q  equal  to  0,  97  %  of  array  references  were  to 
local  array  elements  and  3  %  could  be  accessed  by  referring  to  the  hashed  cache  without 
traversing  any  links.  On  the  other  hand,  for  Interleaved,  only  3  %  of  array  references  were 
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Table  3:  Executor  Computation  Times  for  the  Off-Processor  Storage  Schemes  Sparse 
MVM  for  32  processors 


Problem 

<7 

hashed 

cache 

time 

(ms) 

partial 

enumeration 

time 

(ms) 

full 

enumeration 

time 

(ms) 

Blocked 

m 

122 

119 

93 

Blocked 

Eh 

148 

131 

Blocked 

EH 

172 

142 

Interleaved 

0.0 

185 

140 

■KK 

Interleaved 

0.2 

209 

150 

127 

Interleaved 

0.4 

232 

156 

128 

Unstructured 

34 

27 

20 

Table  4:  MVM:  Data  Accesses  by  Processor  0  during  the  executor  p 


hase 


Problem 

<7 

%  local 

accesses 

%  link 

0 

%  link 

1 

%  link 

2 

%  link 

3 

Blocked 

0.0 

97 

3 

0 

0 

0 

Blocked 

0.2 

79 

18 

3 

0.25 

0 

Blocked 

0.4 

60 

30 

9 

2 

0.15 

Interleaved 

0.0 

4 

96 

0 

0 

0 

Interleaved 

0.2 

4 

79 

15 

2 

0.1 

Interleaved 

0.4 

3 

66 

24 

6 

1 

to  locally  stored  data.  66  %,  24  %  ,  6  %  and  1  %  of  references  to  copies  of  off-processor 
data  required  traversal  of  0,  1,  2,  and  3  links  in  the  hashed  cache  respectively.  In  all  cases, 
over  two  thirds  of  memory  accesses  did  not  require  traversing  even  one  hashed  cache  link. 
It  should  be  noted  that  all  table  entries  over  0.5  were  rounded  to  the  nearest  integer  so 
that  row  entries  may  not  add  to  exactly  100. 

5  Conclusion 

In  this  paper,  we  have  proposed  a  scheme  for  automatically  migrating  data  across  the 
processors  of  a  distributed  memory  machine.  The  scheme  uses  a  hash  table  for  the  easy 
access  and  modification  of  off-processor  data.  Using  such  a  hashed  cache  allows  us  to 
support  parallel  loops  accessing  distributed  data  through  a  global  name  space. 

We  have  investigated  a  set  of  model  problems  to  characterize  the  performance  of  the 
hashed  cache  method.  This  model  problem  analysis  employed  a  synthetic  workload  and 
an  unstructured  mesh  that  allowed  us  to  perform  a  detailed  examination  of  different  rou- 
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tines  required  to  implement  the  hashed  cache  scheme.  The  data  presented  here  suggests 
that  this  scheme  may  be  a  practical  method  for  efficiently  supporting  data  migration  on 
distributed  machines. 
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We  propose  a  data  migration  mechanism^that  allows  an  explicit  and  controlled 
mapping  of  data  to  memory.  While  read  or  write  copies  of  each  data  element  can  be 
assigned  to  any  processor's  memory,  longer  term  storage  of  each  data  element  is 
assigned  to  a  specific  location  in  the  memory  of  a  particular  processor.  We  pre¬ 
sent  data  that  suggests  that  the  scheme  may  be  a  practical  method  for  efficiently 
supporting  data  migration.  -  - 
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